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1  Introduction 

We  focus  on  the  problem  of  profile  building  in  this  year’s  KBA  track  and  proposed  two  methods.  The 
first  method  is  a  baseline,  which  selects  the  stream  items  that  has  exact  string  match  with  the  query 
entity.  All  the  matched  documents  are  assigned  with  the  same  relevance  score.  In  the  second  method, 
we  propose  to  use  the  related  entities  to  help  us  identify  the  information  related  to  the  query  entity. 
In  particular,  we  retrieve  the  wikipedia  pages  for  the  query  entities  and  extract  the  anchor  text  in  all 
the  internal  links  within  the  wikipedia  page.  These  anchor  text  are  treated  as  the  related  entities  of 
the  query  entity  and  they  are  used  to  build  the  profile  of  the  query  entity.  Given  a  stream  item  (i.e. , 
a  document),  the  relevance  score  is  estimated  by  integrating  the  match  with  the  query  entity  and  the 
match  with  the  related  entities.  Results  on  the  training  data  show  that  the  second  method  is  more 
effective. 


2  Entity  Profile  Building 

The  input  query  set  for  the  KBA  system  is  a  list  of  29  entities  from  English  collection  of  Wikipedia.  All 
the  entities  are  manually  selected  by  the  KBA  organizers  and  most  of  the  entities  are  celebrities  and 
selected  from  the  Living_people  category.  A  few  of  them  are  organizations.  Moreover,  the  organizers 
“focused  on  entities  with  complex  link  graphs  of  relationships  with  other  active  entities” .  Therefore,  it 
means  each  entity  has  rich  link  relation  with  other  entities  in  the  Wikipedia,  which  makes  it  possible 
for  us  to  exploit  such  relations  to  build  the  entity  profiles. 

To  solve  the  first  challenge,  i.e.,  entity  profile  building,  we  propose  a  general  approach  by  collecting 
the  internal  links  with  the  Wikipedia  page  of  each  query  entity  eq  and  use  the  anchor  text  of  the  internal 
links  as  related  entities  of  eq. 

Given  an  entity,  the  first  thing  is  to  retrieve  its  Wikipedia  page.  Fortunately,  as  the  query  set 
were  selected  from  the  Wikipedia  collection  directly,  each  entity  are  defined  by  its  so-called  url- 
name.  The  URL  of  the  entitys  Wikipedia  page  can  be  constructed  by  just  appending  the  entity 
name  to  a  Wikipedia  base  URL  (  i.e.,  http://en.wikipedia.org/wiki/  ).  For  instance,  one  of 
the  29  query  entities  is  “Basic_Element_(music_group)” ,  we  can  get  its  authentic  Wikipedia  URL  as 
“http :  /  /  en  .wikipedia.  org/wiki/Basic_Element_(music_group)” .  Instead  of  retrieving  the  HTML 
Wikipedia  page  directly,  we  utilize  the  API  provided  by  Wikipedia  to  dump  the  raw  content  in  json 
format.  Given  a  query  entity  eq,  the  API  accepts  urlname  as  input  and  returns  the  English  Wikipedia 
page  wiki{eq )  in  Wiki  markup1  format.  The  reason  to  parse  Wiki  markup  instead  of  HTML  is  that 
we  think  it  is  much  easier  to  identify  and  extract  the  internal  links. 

1  Wiki  markup:  http://en.wikipedia.Org/wiki/Help:Wiki_markup 
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With  the  retrieved  Wikipedia  page,  we  then  apply  a  python  based  parser  to  parse  the  Wiki  markup 
document  and  extract  the  related  entities.  Basically  there  are  two  types  of  links  for  a  Wikipedia  page: 
internal  link  and  external  link.  The  former  connects  the  entities  within  Wikipedia  and  the  the  latter 
provides  supplemental  information  with  regard  to  the  entity.  An  internal  link  between  two  entities 
indicates  there  are  some  relation  between  them.  Therefore  by  following  the  internal  links  from  the 
query  entity  eq,  we  can  get  a  list  of  entities  which  can  be  treated  as  related  entities.  More  specifically, 
the  internal  link  in  Wiki  markup  document  would  be  denoted  as  shown  in  the  following  example: 

1 ’ ’Basic  Element’1’  is  a  [ [Sweden  I  Swedish] ]  [[eurodance]]  [[hip-hop]]  group  formed 
in  1993. 

There  are  three  internal  links  in  the  example  above,  each  of  which  is  embraced  by  double  square 
bracket.  The  first  link  [  [Sweden  I  Swedish]  ]  contains  two  parts,  separated  by  a  vertical  bar.  The  first 
part  is  the  urlname ,  and  the  second  part  is  the  anchor  text  of  the  link  shown  on  the  rendered  HTML 
page.  Therefore,  this  link  would  be  rendered  as  a  link  to  http://en.wikipedia.org/wiki/Sweden 
with  the  anchor  text  “Swedish” .  The  following  two  links  [  [eurodance]  ]  [  [hip-hop]  ]  just  have  one 
part,  and  the  text  within  the  double  square  bracket  serves  as  both  the  urlname  and  the  anchor  text. 
Therefore,  we  can  extract  three  related  entities  from  the  markup  text  above,  i.e.,  Sweden,  eurodance 
and  hip-hop.  Formally,  given  a  Wikipedia  page  wiki(eq),  we  can  extract  a  set  of  related  entities 
rel(eq)  =  {e|e  €  E(wiki(eq))}  where  E(wiki(eq))  denotes  all  the  Wikipedia  entities  in  wiki(eq). 

With  the  set  of  related  entities,  we  define  the  entity  profile  profile(eq)  as  follows: 

profile(eq)  =  {eq,rel(eq)}.  (1) 


3  Entity  Profile  based  Stream  Filtering 

We  now  discuss  how  to  use  the  entity  profile  prof  ile(eq)  to  do  stream  filtering.  Given  a  stream 
document  d ,  we  need  to  estimate  how  likely  the  document  is  relevant  to  the  query  entity  eq.  As  shown 
in  Equation  (1),  an  entity  profile  prof  ile(eq)  consists  of  the  entity  itself  eq  and  its  related  entities 
rel(eq ),  therefore,  the  relevance  of  the  document  should  be  determined  by  both  eq  and  rel(eq).  To 
reflect  this,  we  propose  the  following  method  to  estimate  the  relevance  between  d  and  prof  ile{eq ): 

score(d ,  eq)  =  a  ■  mentioned,  eq)  +  f3  ■  occ(d,  e),  (2) 

e£rel(eq) 


where  mentioned ,  eq)  is  an  function  which  identifies  the  document  d  mentions  eq  and  it  is  defined  as: 


mentioned,  eq) 


1  if  d  mentions  eq, 
0  otherwise. 


(3) 


Moreover,  occ{d ,  e)  denotes  the  occurrences  of  e  in  d.  a  and  f3  are  the  coefficients  which  assign  different 
weights  to  different  score  components  to  balance  their  influences.  The  main  idea  behind  Equation  (2) 
is  that  we  want  to  capture  whether  the  document  mentions  eq  as  well  as  any  related  entities  in 
rel(eq).  The  first  component  (a  ■  mentioned,  eq))  checks  whether  eq  is  discussed  in  d.  and  the  second 
component  (/?  •  EeereZ(e,)  occ(^;  e))  serves  as  the  complementary  information  to  the  relevance  score 
under  the  assumption  that  the  more  related  entities  occur  in  d ,  the  more  likely  d  is  relevant  to  eq. 
Since  the  first  component  is  the  main  body  of  the  relevance  score,  a  should  be  much  larger  than  (3  to 
reflect  it. 

With  the  relevance  score  calculated,  we  then  set  a  threshold  T  to  determine  whether  the  document 
is  relevant  to  the  query  topic  eq  or  not.  The  stream  document  with  the  relevance  score  above  T  will 
be  kept  and  others  will  be  discarded. 
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Evaluation  Set 

Central 

Central+Relevant 

Run 

maxF 

rnaxSU 

maxF 

rnaxSU 

all- mean 

0.220 

0.311 

0.404 

0.498 

all- median 

0.289 

0.333 

0.553 

0.554 

UDInfo-KB  A_EX 

0.297 

0.154 

0.605 

0.580 

UDInfo-KB  A_WIKI1 

0.342 

0.331 

0.639 

0.611 

UDInfo-KB  A_WIKI2 

0.354 

0.331 

0.636 

0.617 

UDInfo-KB  A_WIKI3 

0.355 

0.331 

0.597 

0.592 

Table  1:  Results  of  official  runs.  The  maxF  and  rnaxSU  measures  are  reported  by  the  official  evaluation 
program,  which  collects  the  maximum  FI  and  SU  for  each  topic  at  certain  relevance  score  cutoff  and 
report  the  average  then,  all-mean  and  all-median  are  the  mean  and  median  of  results  aggregated 
from  all  the  submitted  runs  in  this  year’s  KBA  track  respectively. 


4  Experiment  Results 

4.1  Submitted  Runs 

We  submitted  four  official  runs  to  the  KBA  track.  The  main  difference  between  them  are  a,  (3  in 
Equation  (2)  and  filtering  threshold  T.  The  detail  description  are  summarized  as  follows. 

1.  UDInfo-KBA_EX:  a  =  1000,  (3  =  0  and  T  =  0.  It  is  a  special  case  and  Equation  (2)  falls 

back  to  score(d,  eq)  =  1000  •  mention(d ,  eq),  which  means  the  relevance  score  is  estimated  based 
on  whether  there  is  an  exact  match  of  query  entity  eq  in  document  d.  If  there  is  an  exact  match, 

the  relevance  score  would  be  1000.  Otherwise,  it  would  be  0.  This  run  serves  as  a  baseline. 

2.  UDInfo-KBA_WIKU:  a  =  100,  (3  =  1  and  T  =  101.  The  main  idea  of  this  method  is  that 
besides  the  exact  match,  we  also  want  to  capture  the  match  of  the  related  entities.  T  is  set 
empirically  based  on  the  results  of  training  data. 

3.  UDInfo-KB A_WIKI2 :  a  =  100,  (3=1  and  T  =  102. 

4.  UDInfo-KB A_WIKI3:  a  =  100,  (3  =  1  and  T  =  103. 

4.2  Results  Analysis 

The  results  of  all  the  runs  are  summarized  in  Table  1.  We  can  find  that  all  of  our  four  runs  can 
reach  good  results  among  all  the  submitted  runs.  Moreover,  by  incorporating  the  match  of  the 
related  entities  into  the  estimation  of  relevance  score,  the  performance  can  be  improved,  showing  the 
effectiveness  of  related  entities  in  the  entity  profile  based  filtering. 

To  better  understand  the  performance  of  each  query,  we  plot  the  maxF  per  query  on  both  central 
and  central+relevant  to  compare  them  side  by  side,  as  shown  in  Figure  1  and  Figure  2,  respectively. 
We  can  find  that  generally  our  entity  profile  based  filtering  method  can  outperform  both  the  baseline 
and  all-mean  on  most  queries. 


5  Conclusion 

It  is  the  first  year  of  KBA  track,  and  the  task  is  relatively  simple  as  the  organizers  want  to  get  first 
impression  on  how  the  data  would  fit  the  task  of  knowledge  base  acceleration.  We  propose  an  entity 
profile  based  filtering  framework  and  derive  two  methods  to  solve  this  year’s  task.  Experiment  results 
on  the  testing  data  show  that  our  methods  are  effective  to  select  the  relevant  documents.  We  find 
that  by  incorporating  the  related  entities  into  the  entity  profile  can  improve  the  performance,  showing 
that  the  related  entities  are  important  on  finding  the  relevant  documents. 
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maxF  of  each  topic  on  the  testing  data  (central) 


maxF  of  each  topic  on  the  testing  data  (central+relevant) 


4 


